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AAA51759 197 aa 

apolipoprotein B. 

AAA51759 

AAA5175 9 . 1 GI : 17882 2 

locus HUMAPOBX accession K03175.1 



linear 



PRI 31-OCT-1994 



Homo sapiens (human) 
H omo sapien s 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Euarchontoglires ; Primates; Haplorrhini ; 
Catarrhini; Hominidae; Homo. 
REFERENCE 1 (sites) 

AUTHORS Deeb,S.S. and Lebo,R. 
JOURNAL Unpublished (1985) 
REFERENCE 2 (residues 1 to 197) 

AUTHORS Deeb,S.S., Motulsky , A. G . and Albers, J.J. 
TITLE A partial cDNA clone for human apolipoprotein B 

JOURNAL Proc. Natl. Acad. Sci . U.S.A. 82 (15), 4983-4986 (1985) 
PUBMED 3 8 6 08 3 6 
COMMENT [1] sites; genomic location. 

A draft entry and printed copy of this sequence was kindly provided 
by S.S.Deeb (21-OCT- 1985) . [1] states that the genomic location of 
the gene encoding this mRNA is chromosome 2 p23-p24. 
Method: conceptual translation. 
FEATURES Location/Qualifiers 
source 1 . . 197 

/organism= "Homo sapiens " 
/db_xref ="taxon: 9606" 
/map="2p24-p23 " 
Protein 1. . 197 

/name= "apolipoprotein B" 
CDS 1 . . 197 

/gene="APOB" 

/coded_by= "K03175 . 1 : <1 . . >593 " 

/codon_start=3 

/ db xref=" GDB : G 00-11 9 -686 " 



ORIGIN 



// 



1 gffpdsvnka lywvngqvpd gvskvlvdhf gytkddkheq dmvngimlsv eklikdlksk 

61 evpearaylr ilgeelgfas lhdssswkaa shgcphsagd pqmigevirk gskndfflhy 

121 ifmenafelp tgaglqlqis ssgviapgak agvklevanm qaelvakpsv svefvtnmgi 

181 iipdfarsgv qmntnff 
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linear PRI 09-NOV-1994 



AAB59397 317 aa 

apolipoprotein E. 

AAB59397 

AAB5 93 97 .1 GI : 17 8 853 

locus HUMAPOE4 accession M10065.1 



Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Euarchontoglires ; Primates; Haplorrhini; 
Catarrhini; Hominidae ; Homo. 
1 {residues 1 to 317) 

Das,H.K., McPherson, J . , Bruns,G.A. , Karathanasis , S . K . and 
Breslow, J.L. 

Isolation, characterization, and mapping to chromosome 19 of the 

human apolipoprotein E gene 

J. Biol. Chem. 260 (10), 6240-6247 (1985) 

3 922 972 

3 (residues 1 to 317) 

Emi,M., Wu,L.L., Robert son , M . A . , Myers, R.L., Hegele,R.A., 
Williams, R.R. , White, R. and Lalouel,J.M. 

Genotyping and sequence analysis of apolipoprotein E isoforms 
Genomics 3 (4), 373-379 (1988) 
3 24 35 53 

[3] two allelic variations. 

Draft entry and computer- readable sequence for [3] kindly provided 
by M.Emi, 19-AUG-1988. 

Apolipoprotein E is a constituent of the human very low density 
lipoprotein in the plasma. There are at least six distinct 
phenotypes derived from the single E gene on chromosome 19; next to 
the epsilon-3 allele (see separate entry) , the epsilon-4 allele, 
represented by the sequence below, is most common, the product 
difference being arginine in place of cysteine at residue 112 [2] . 
The gene structure of apo E is similar to that of other apo genes: 
presence of the 66 -bp repeats in the fourth exon (starting at base 
3782 below) makes the E gene highly similar to the A-I gene (see 
separate entry) as argued by [1] . 

A potential TATA box is found at positions 1014-1018, and a 
potential polyadenylation signal at 4616-4621. 

[2] and [1] had slight differences in the boundary positions for 
the Alu repeats and their flanks; the boundary positions indicated 
in [1] have been used in the FEATURES table below. Draft entries 
and clean copies were kindly supplied by J.M. Taylor, Gladstone 
Laboratories, San Francisco, and by J. P. Levine, Rockefeller 
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University, New York. 

Method: conceptual translation. 

Location/Qualifiers 

1 . . 317 

/organism= "Homo sapiens" 
/db_xref ="taxon: 96 06" 
/map="19ql3 .2" 
1 . . 317 

/product="apolipoprotein E" 
1 . . 18 
19. .317 

/product="apolipoprotein E" 
1 . . 317 

/coded_by=" join (M10065. 1:1871. . 1913 , M10065 . 1 : 3007 . .3199, 
M10065 . 1 : 3781 . .4498) " 
/note= "precursor" 



1 mkvlwaallv tflagcqakv eqavetepep elrqqtewqs gqrwelalgr fwdylrwvqt 
61 Iseqvqeell ssqvtqelra lmdetmkelk aykseleeql tpvaeetrar lskelqaaqa 
121 rlgadmedvr grlvqyrgev qamlgqstee lrvrlashlr klrkrllrda ddlqkrlavy 
181 qagaregaer glsairerlg plveqgrvra atvgslagqp lqeraqawge rlrarmeemg 
241 srtrdrldev keqvaevrak leeqaqqirl qaeafqarlk swfeplvedm qrqwaglvek 
3 01 vqaavgtsaa pvpsdnh 
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BLink, Conserved 
Domains, Links 



ORIGIN 



AAB25217 449 aa linear PRI 28-JUN-1993 

apolipoprotein- J, Apo-J, SP- 40 , 40=plasma glycoprotein/complement 
system hemolysis modulator [human, seminal plasma, Peptide, 44 9 
aa] . 

AAB25217 

AAB25217.1 GI:298237 
accession AAB25217.1 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Euarchontoglires ; Primates; Haplorrhini ; 
Catarrhini; Hominidae; Homo. 
1 (residues 1 to 449) 

Choi-Miura,N.H. , Takahashi , Y . , Nakano,Y., Tobe,T. and Tomita,M. 
Identification of the disulfide bonds in human plasma protein 
SP-4 0 , 4 0 ( apol ipoprotein- J ) 
J. Biochem. 112 (4), 557-561 (1992) 
1491011 

GenBank staff at the National Library of Medicine created this 
entry [NCBI gibbsq 124045] from the original journal article. 
Method: sequenced peptide, ordered by overlap. 
Location/ Qualifiers 
1. .449 

/organi sm= " Homo sapiens " 
/db_xref = " taxon ; 9606 " 
1. .449 

/product= 11 apol ipoprotein- J" 

/name="plasma glycoprotein/complement system hemolysis 
modulator" 

/note="The beta chain contains amino acid interval from 1 
to 205 and the alpha chain contains amino acid interval 
from 206 to 427; Apo-J; SP-40, 40" 

1 mmktlllfvg llltwesgqv lgdqtvsdne lqemsnqgsk yvnkeiqnav ngvkqiktli 
61 ektneerktl lsnleeakkk kedalnetre setklkelpg vcnetmmalw eeckpclkqt 
121 cmkfyarvcr sgsglvgrql eeflnqsspf yfwmngdrid sllendrqqt hmldvmqdhf 
181 srassiidel fqdrfftrep qdtyhylpfs lphrrphfff pksrivrslm pfspyeplnf 
241 hamfqpflem iheaqqamdi hfhspafqhp ptefiregdd drtvcreirh nstgclrmkd 
301 qcdkcreils vdcstnnpsq aklrreldes lqvaerltrk ynellksyqw kmlntsslle 
3 61 qlneqfnwvs rlanltqged qyylrvttva shtsdsdvps gvtevvvklf dsdpitvtvp 
421 vevsrknpkf metvaekalq eyrkkhree 
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Human apolipoprotein B: analysis of internal 
repeats and homology with other apolipoproteins 



Hans Dc Loof,* Maryvonnc Rosseneu,* Chao-Yuh Yang,t Wen-Hsiung Li,tt 
Antonio M. Gotto, Jr.,T and Lawrence Chanf ** 

Department of Clinical Biochemistry,* A. Z. St-Jan, B-8000 Brugge, Belgium; Department of Medicine | 
and Department of Cell Biology,* * Baylor College of Medicine, Houston, TX 77030; and Center for 
Demographic and Population Gcnetics,tt University of Texas, Houston, TX 77030 



Abstract Apolipoprotein B (apoB) is the major protein com- 
ponent of plasma low density lipoproteins (LDL) and, through 
its binding to the LDL receptor, it plays a prominent role in lipo- 
protein metabolism and in the development of atherosclerosis. 
Specially developed computer programs were applied to detect 
potential internal repeats in the human apoB sequence and 
homology of some of these repeats with other apolipoproteins. 
The simultaneous computer alignment of several (repeated) se- 
quences, carried out in an iterative way to generate consensus 
sequences, showed the presence of repeated amphipathic helical 
regions and of repeated hydrophobic proline-rich domains. Ex- 
tensive Monte-Carlo statistics were used to demonstrate the 
statistical significance of the internal repeats. Both classes of 
repeats may contribute to the specific lipid-binding charac- 
teristics of apoB, Additional homology, detected between apoB 
and apoE, the other apolipoprotein-ligand of the LDL receptor, 
further defined the structural requirements for this receptor- 
ligand interaction. The computer programs developed in this 
study should also be useful for detecting internal repeats in other 
proteins.-De Loof, H., M. Rosseneu, C-Y. Yang, W-H. Li, 
A, M. Gotto, Jr., and L. Chan. Human apolipoprotein B: anal- 
ysis of internal repeats and homology with other apolipopro- 
teins. J. Lipid Res. 1987. 28: 1455-1465. 

Supplementary key words lipid-binding • receptor-binding* athero- 
sclerosis 



Apolipoprotein (apo) B-100 is the protein ligand in low 
density lipoproteins (LDL) that binds to the LDL recep- 
tor (1). It is thus an important determinant that regulates 
LDL metabolism. Elevated plasma levels of LDL-apoB 
arc strongly associated with increased risk of coronary 
artery disease (2, 3). Indeed, hyperapoB is a significant 
risk factor for atherosclerosis, even in the presence of nor- 
mal serum cholesterol (4). 

ApoB -100 is the largest protein component in human 
lipoproteins. The protein is characterized by its extreme 
insolubility in aqueous media after removal of the lipid, 
by its inability to transfer among lipoprotein particles, 
and by its high molecular weight (5). 



Until recendy, litde was known of the apoB-100 primary 
structure. Attempts at its elucidation were hampered by 
its enormous size, insolubility, and tendency to aggregate. 
Recently, the primary structure of apoB-100 has been 
deduced from its cDNA sequence by four different lab- 
oratories (6-10). ApoB-100 turns out to be the largest 
monomelic protein ever studied. It comprises 4,536 amino 
acid residues, with a calculated molecular mass of 512,937 
daltons. 

When apoB-100 is digested with trypsin, the amino acid 
composition of the remaining parts of the protein shows 
little difference from that of the undigested protein (11, 
12). Furthermore, the amino acid composition of the 
carboxyl-terminal fifth of apoB-100 (836 residues) differs 
only slightly from that of the whole molecule (13). These 
observations suggest that apoB-100 contains internal 
repeats. This is interesting because it might explain how 
such an exceptionally large protein has evolved and 
because internal repeats have been found in all the other 
apolipoproteins (14-17). Our initial analysis of the human 
apoB-100 sequence has indeed suggested the presence of 
numerous repeated sequences within this huge protein 
(8). 

However, the existence of internal repeats in apoB-100 
deserves a more careful study because the statistical sig- 
nificance of the potential repeats suggested in our 
previous study was not rigorously established and because 
two other laboratories failed to detect such repeats (7, 9) 
and a third laboratory could identify only uniquely 
repeated sequences of 6-12 residues, scattered throughout 
the apoB sequence (10). 

Here, we have carefully re -analyzed the human apoB- 
100 sequence using specially developed computer pro- 
grams that detect statistically significant repeats within 



Abbreviation: LDL, low density lipoprotein. 
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extraordinarily long sequences such as apoB-IOO. The 
question of internal repeats in apoB-100 is important not 
only because it provides one mechanism for the evolution 
of such an exceptionally large protein, but more impor- 
tantly, because the different repeats might delineate func- 
tionally important domains within this unique protein. In 
addition, we have analyzed for potential homology be- 
tween apoB-100 and other apolipoproteins. Our results 
provide clues to some common structure-function rela- 
tionship between these evolutionarily related proteins, es- 
pecially with respect to the probable mechanisms of 
lipid-binding, and ligand-receptor interaction of apoB- 
100 and apoE with the LDL receptor. 

MATERIALS AND METHODS 

Comparison matrices 

We analyzed the sequence reported by Yang et al (8). 
Analysis of internal repeats in apoB and of its homology 
with the other apolipoproteins was based on the compar- 
ison-matrix method (18, 19). In this method, all possible 
segments of a given length from one sequence are com- 
pared with all segments of the same length from the other 
sequence using a scoring matrix. The same procedure can 
also be applied to a single sequence to identify and locate 
internal repeated sequences. By using a scoring matrix, 
this procedure allows the detection of segments that are 
not exact duplications, but in which amino acid substitu- 
tions have generally conserved the physicochemical pro- 
perties of the residues. The scoring matrix used is that of 
Staden (20) and is derived from the mutation data matrix 
(PAM 250) devised by Dayhoff, Barker, and Hunt (18). 
All scores exceeding a certain threshold value are plotted 
in a two-dimensional graph with coordinates correspond- 
ing to the center of the compared residue spans. We used 
25- and 40-residue-long segments and calculated the com- 
parison scores between all segments. Each comparison 
score was divided by the segment length in order to obtain 
a mean score. In this way, calculations using different seg- 
ment lengths can be compared. The threshold value was 
set at a mean score of 11, a level such that the well- 
documented internal repeats of apoA-I, A-IV, or E appear 
clearly on the comparison matrices (data not shown). 
Because a comparison matrix of very long sequences con- 
tains a large amount of data, the data were represented in 
a condensed way. The number of scores for the compari- 
son of one span with the rest of the sequence, exceeding 
the threshold value, was plotted in the center of that span. 

An alternative method of Kubota et al. (21) was applied 
to graphically locate the internal repeats. In this tech- 
nique, cross correlation coefficients arc calculated be- 
tween 25-residue segments. The average of three corre- 
lation coefficients, obtained by using three different 



parameters for the 20 amino acids, was plotted in a com- 
parison matrix when the average was higher than 0.475. 
These three parameters used were the conformational 
parameters determined by Levitt (22). They are based on 
the physicochemical characteristics of the amino acids, 
not on their mutability as in the calculations using the 
scoring matrix of Staden (20). No more than three 
parameters were used at a time due to limitations in com- 
puter time and memory. Computations with one para- 
meter such as hydrophobicity (23) or bulkiness (24) were 
also carried out. 

A third method was used to generate another type of 
comparison matrix using the programs FASTP and RDF 
of Lipman and Pearson (25). The complete apoB se- 
quence was divided into 200-residue segments starting 
every 100 residues. All non-overlapping segments were 
compared and alignments were optimized. The data ob- 
tained by this comparison procedure were plotted in a 
two-dimensional matrix. 

Multiple sequence alignment 

Alignment of the repeated sequences in apoB was car- 
ried out using specially developed computer programs. 
For this purpose, all consecutive apoB segments with a 
certain length (a "window" was moved through the se- 
quence) were compared to a query sequence of the same 
length. These sequences were aligned using the Needle- 
man-Wunch algorithm (26) and a comparison score was 
calculated using the scoring matrix of Staden (20). Multi- 
ple gaps were allowed but penalized (n * number of con- 
secutive gaps, penalty = 20 + [n - 1] x 5). All non-overlap- 
ping segments exceeding a certain score were saved and 
used to generate a new query sequence. This procedure 
was repeated and carried out in an iterative way to 
generate optimized consensus sequence. Usually after 
5-15 cycles, the consensus sequence remained unchanged 
and was then considered as optimized. 

Initial query sequences were chosen as those domains 
of apoB that yielded the highest homology with other 
parts of the sequence or with the other apolipoproteins 
(see Fig. 1, B and D). This procedure was first validated 
by aligning the multiple repeats within the apoA-I se- 
quence. The various parameters (threshold score, mini- 
mal alignment, gap penalty) were selected such that the 
computer-generated alignments were in agreement with 
those obtained by manual alignment (17). We tested this 
procedure with the apoE sequence, a repetitive protein 
whose repeats are not as well defined and are not punc- 
tuated by proline residues. The overall result obtained us- 
ing our program is very similar to the one reported by 
Boguski et al. (19) or Luo et al. (17) (data not shown). 

For the longer consensus sequences ( >25 residues), the 
process was started with shorter segments. When a con- 
sensus was obtained, it was used as the core of longer 



1456 Journal of Lipid Research Volume 28, 1987 



query sequences ( + 3 residues on both sides). This proce- 
dure was repeated until the number of aligned sequences 
or the average alignment score decreased to a minimum. 

Statistical analysis 

In order to validate these consensus sequences and to 
locate the homologous regions throughout the sequence, 
we used the method developed by Kubota et al. (27). 
Average cross-correlation coefficients, based on ten 
parameters, were calculated for all consecutive segments 
of apoB, compared with the consensus sequences. Cor- 
relation coefficients greater than 0.3 were considered to be 
significant as originally proposed by Kubota et al. (27). 
This method has been successfully used to detect the in- 
ternal repeats in rat apoA-IV (19). 

The use of a computer-algorithm for the alignment of 
several sequences enables quantification of the statistical 
significance of the repeated sequences by the Monte- 
Carlo technique (18). The distribution of scores in the 
Dayhoff matrix and homologies with the consensus se- 
quence for randomized sequences was obtained by 
simulation. The random sequences used have the same 
length and the same amino acid composition as the se- 
quence under study. The probability of obtaining a 
number of scores equal to or larger than a given value X 
is expressed as the number of standard deviations (SD 
value) from the mean value of the randomized sequences 
to the actual number of segments equal to or larger than 
X (18). Randomizations were carried out 100 times. 

Control calculations were carried out by using the dis- 
tance matrix developed by Bacon and Anderson (28), 
which is based on the Euclidean distances derived from a 
number of physicochemical characteristics of the 20 
amino acids. These computations ruled out the possibility 
that the high values for cysteine, tryptophan, or proline in 
the Staden distance matrix (18) contribute to the statis- 
tical significance of the repeats. Validation of this method 
was carried out using two non-repetitive proteins: 0- 
globin (human) and serum rctinol-binding protein, a 
Hpid-binding protein. Positive controls were apoA-I and 
apoE. Randomizations in the RDF program (25) were 
carried out 1,000 times. 

Analysis of secondary structure potential 

The secondary structure of the repeated sequences was 
analyzed by the methods of Chou and Fasman (29) and 
Gamier, Osguthorpe, and Robson (30). Helical hydro- 
phobic moments were calculated as previously described 
(31). 

RESULTS 

The comparison matrix of apoB with itself is shown in 
Fig. 1A. Repeats are indicated as relatively short di- 



agonals, parallel to the main diagonal. These computa- 
tions were carried out using 2 5- residue segments (upper 
right half of the figure) and 40-residue segments (lower 
left half). The plot consists only of segments with a mean 
comparison score equal to or higher than 11. As shorter 
segments have a higher probability of chance similarity 
than longer segments (see below), the background is 
higher with the 25-residue segments. A close inspection 
shows that in both parts of the figure, the same regions of 
apoB contain clusters of diagonals, indicative of internal 
repeats. The comparison scores for each of the 25-residue 
segments were generally only moderately higher than 11; 
all were below 12.9. For comparison, the maximum score 
for the segments in apoA-I, a protein known to have 
numerous internal repeats, was 12.4. 

In order to estimate the statistical significance of these 
internal repeats, we compared the distribution of scores 
obtained for the real sequence with those of the random- 
ized sequences (Fig. 2, A and B). For comparison, the 
cumulative probability plots for apoA-I, apoE, 0-globin, 
and serum retinol binding protein are presented in Fig. 
2, C-F. A clear shoulder can be seen in the cumulative 
probability distribution of the scores for apoB and for 
both apoA-I and apoE. The SD value for apoB reaches a 
value of 24 for segments with a score higher than or equal 
to 11. This SD value increases at higher scores, up to 139 
for the 34 comparisons with a score of 12.5 or more. The 
use of an independent scoring matrix of Bacon and An- 
derson (28) confirmed the statistical significance of the 
observation (data not shown). The distributions for apoA- 
I and apoE both show marked differences between the ac- 
tual and the randomized sequences indicative of the 
presence of internal repeats. The two overlapping distri- 
butions for the negative control proteins are also clear 
(Fig. 2, E-F). Differences in the latter case are always 
smaller than 3 SD units. 

Since all the other apolipoproteins are believed to have 
a common evolutionary origin and share common repeats 
(17, 19, 32-34), comparison matrices were constructed 
between apoB and the apolipoproteins A-I, A-II, A-IV, E, 
C-I, C-II, and C-III (Fig. 1C). These comparison 
matrices show that homology exists between some regions 
of apoB and the other apolipoproteins. The total number 
of segments in the small apolipoproteins exceeding the 
threshold value in the comparison with one segment of 
apoB were also plotted (Fig. ID). This line plot shows that 
the homology is located within certain distinct domains of 
the apoB protein. Two large domains that contain most of 
the homologous sequences are located between residues 
2035-2506 and between residues 4002-4527. Some 
smaller domains, as detailed in the legend of Fig. 1, were 
also detected. 

An estimation of the chance occurrence of these homo- 
logies was obtained by randomization of the small apoli- 
poprotein sequences followed by a comparison with the 
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Fig. 1. (A) Sequence comparison matrix of apoB with itself. Two different segment lengths were used. In the upper right part of the figure, 25- 
residue segments, and in the lower left part, 40-residue segments are compared. Comparison scores, calculated using the matrix proposed by Staden 
(20), exceeding a mean score of 11 are plotted as dots with coordinates corresponding to the centers of the segments for the 25-residue segments and 
at residue 20 for the 40-residue long segments. (B) Lineplot showing the number of segments of the complete protein homologous to a particular 
25-residue long segment, with a mean score exceeding the threshold value of 11. These numbers are plotted in function of the center of these segments. 
(C) Comparison matrix of apoB with human apoA-I, A-II, A-IV, E, C-I, C-II, and C-III calculated as in Fig. 1A. Sequences are taken from Mahiey 
ct al. (37) and Boguski et al. (19). The segment length is 25. (D) Lineplot showing the number of segments within the different small apolipoproteins 
having homology with one segment of apoB using 25-residue spans. Two large domains that contain most of the homology are situated between 
residues 2035-2506 and between residues 4002-+527. Homology in these domains is interrupted by several nearly blank 2ones, e.g., between residues 
2200-2230, 4095-4126, and 4163-4240. Additional smaller peaks are situated around residues 332, 484, 7*3, 1174, 1643, 1784, and 3623. (E) Lineplot 
showing the domains within the different small apolipoproteins that show homology with apoB. Homology seems to be of the same order of magnitude 
for most of the apolipoproteins. Detailed analysis of the plot for apoE, for example, shows that homology is nearly interrupted between residues 
160-215. This corresponds to the central divergent zone of exon 4 in apoE as described by Boguski et al. (19). 



complete apofi sequence. Although the SD values were 
lower than those for the internal repeats of apoB, they 
were statistically significant. The number of scores ex- 
ceeding the threshold value of 11 yielded SD values of 6.9, 
5.3, 10.5, and 8.5 for apoA-I, A-II, A-IV, and E, respec- 
tively. 

Fig. IE shows that the homology of apoB with each of 
the other apolipoproteins is of approximately the same 
order of magnitude and that the homology is mainly 
located in the domains with potential amphipathic 
helices. A more detailed analysis, for example with apoE, 
shows that homology with apoB is interrupted by a nearly 
blank zone extending from residues 160-215. This corres- 
ponds to the central divergent zone of exon 4 as described 
by Boguski et al. (19), the domain showing the lowest 
homology with apoA-IV or with the other amphipathic 
repeats within apoE. 

The identification of the apoB domains with homology 
to the other apolipoproteins is useful for the interpreta- 
tion of the comparison of apoB plot with itself. This is ob- 



vious from the analysis of the apoB domains, homologous 
to 15 segments or more (Fig. IB), but not homologous 
with the other apolipoproteins (Fig. ID). These domains 
correspond to some ,of the multiple proline-associated re- 
gions identified by Knott et al. (7) and are located in the 
segments containing residues 1132-1397, 2528-2760, 
3120-3300, and 3620-3875. The comparison plots using 
the cross correlation coefficients show a similar pattern as 
shown in Fig. 3A. Similar plots, using only one 
parameter such as hydrophobicity (23), also allowed the 
localization of these two classes of repeats (data not 
shown). This finding indicates that neither the computa- 
tion method nor the initial data set influenced the final 
results. 

The presence of internal repeats within apoB is also 
confirmed by analysis using the FASTP program (25). 
The results of these comparisons, plotted on Fig. 3B 
which shows the presence of many long repeats, are also 
in agreement with the conclusions from the other methods. 

In contrast to apoA-I and apoA-IV where the repeats 
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Fig. 2. Cumulative probability (LN probability) distribution of scores in the comparison matrices of real se- 
quences (x) and randomized sequences (O) of apoB with itself using 25-residue segments (A) and 40-residue seg- 
ments (B). Cumulative probability distribution of the comparison of apoA-1 (C), apoE (D), 0-globin (E), and serum 
retinol binding protein (F) with itself using 25-residue segments. For apoB, apoA-I, and apoE, a shoulder can be 
seen indicative of the non-randomness of the occurrence of repeats. Sequences were taken from reference 36 and 
the NBRF data bank. 



are delineated by proline residues, no sharp boundaries 
are found between the repeats and, as repeal spacing is 
variable, alignments were performed by computer proce 
dures. Iterative multiple alignments using the Needle- 
man-Wunsch algorithm (26) were carried out with different 
query sequences and resulted in the optimal alignment of 
the two classes of repeats with the consensus sequences as 
shown in Table 1 and Table 2. In agreement with the com- 
parison matrix (Fig. 1), the aligned segments belong to 
distinct domains of the apoB sequence. The proline- rich 
repeats, homologous to the first 52- residue-long consen- 
sus sequence, occur in sequences starting at residues 1283, 
2574, 2666, 3245, 3711, and 3805. All have a mean com- 
parison score greater than 11. Moreover, the correlograms 
confirm the location and statistical significance of these 
consensus sequences with the complete sequence (Fig. 4). 
Four major peaks, starting at residues 1283, 2666, 3245, 
and 3805 have correlation coefficients exceeding 0.39, well 
above the confidence limit of 0.3 used by Kubota et al. 
(27). They are located within the four proline-rich do- 



mains described above. In addition to these four major 
peaks, smaller ones, starting at residues 1278, 2567, 2606, 
2646, 2661, 2671, and 3366 have correlation coefficients 
around 0.3. This suggests that smaller repeats may build 
up the larger ones as is evidenced by the correlogram of 
25-residue-long consensus sequences. Peaks with correla- 
tion coefficients greater than 0.45 arc identified as se- 
quences starting at residues 1289, 1296, 2585, 2704, 3219, 
3251, 3717, and 3823. The aligned sequences (scores 
> 11.5) start at residues 1296, 2585, 2629, 2704, 3219, 
3251, 3717, and 3823. 

Extensive Monte-Carlo simulations (18) were performed 
to show the statistical significance of the repeats and to 
eliminate the possibility that the consensus sequences 
contained a bias, due to the pooling together of certain 
residues without a specific sequence. Randomized con- 
sensus sequences were compared with the actual apoB se- 
quence preserving the amino acid composition of the 
various domains. None of the 100 randomizations of the 
52-residue-long sequence yielded a single alignment with 
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Fir. 3. Comparison matrices using alternative methods. (A) Left: Comparison matrix of apoB with itself obtained by the method developed by 
Kubota et al. (21) using the conformational parameters of Levitt (22). Correlation coefficients exceeding 0.475 are plotted. The clusters of diagonals 
are evident in the same regions of apoB as shown in Fig. 2A. (B) Right: Comparison of 200-residue fragments of apoB with all other non -overlap ping 
200-rcsidue fragments using the FASTP program (letup - !);( + ) indicates a score equal to or higher than 60; (x) indicates alignments equal to or 
longer than 90 residues; and (O) combines the latter two criteria. Most of the comparisons indicated by a circle have SD values, determined by the 
RDF program (25) using 1000 randomizations of three or more. 



a mean score equal to or greater than 11. The inverse ap- 
proach of randomizing the apoB sequence also failed to 
yield such alignments. The seven 25-residue-long consen- 
sus sequences with a mean score of 11.83 or higher reached 
an SD value of 40. The FASTP and RDF program of Lip- 
man and Pearson (25) applied to these consensus se- 
quences compared to different 1500-residue segments of 
apoB yielded several single optimal alignments with SD 
values between 9.5 and 12.3 for the first consensus se- 
quence, and SD values between 5.6 and 7.9 for the 25- 
residue consensus sequence. 

The second class of internal repeats, together with the 
22-residue consensus, is shown in Table 2. Analysis of 
the consensus sequence by predictive algorithms and its 
presentation in an Edmundson-wheel diagram (35) (Fig. 
5A) both indicate that it has an amphipathic structure. 
Hydrophobic amino acids are located on one side of the 
helix while the polar residues are located on the other 
side. However, as pointed out by Boguski et al. (19), con- 
sensus sequences are only approximations of real se- 
quences. In order to examine whether each of the internal 
repeats identified in Table 2 is amphipathic in nature, we 
have calculated the mean hydrophobicity and the helical 
hydrophobic moment of each sequence individually. Data 
(not shown) indicate that all the sequences have a high 
helical hydrophobic moment consistent with an am- 



phipathic structure. Furthermore, presentation of the se- 
quences in Edmundson-wheel diagrams clearly indicates 
that indeed each of them forms an amphipathic helix (data 
not shown). 

Alignments starting at residues 2079, 2135, 2173, 2384, 
2407, 4150, 4237, 4397, and 4463 have a mean score ex- 
ceeding 11.5. On the correlogram, the highest peaks are 
also concentrated in the two domains described above; 
peaks with a score higher than 0.4 start at residues 2079, 
2173, 2384, 2407, 2500, 2507, 4038, 4150, and 4237. Sta- 
tistical significance, calculated in the same way as for the 
"proline-rich" repeats, yielded an SD value of 21.3 for the 
nine repeats with a score higher than or equal to 11.59. 

A correlogram of this consensus sequence with the hu- 
man apoA-IV sequence revealed the tandemly organized 
repeat structure (19) of apoATV (Fig. 4D), confirming the 
amphipathic helical characteristics of this consensus se- 
quence. A final test using the RDF program (25) shows 
that only certain domains yield optimized alignments 
with SD values higher than 3 (Fig. 4E). 

Knott et al. (36) reported some homology of residues 
140-150 in apoE with residues 3357-3367 in apoB, 
although they also reported that there was no significant 
homology between apoB and any apolipoprotein, including 
apoE (7, 36). Based on our homology calculations, the 
receptor-binding domains for apoB and apoE seem to in- 
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TABLE 1 . "Prolinc-rich" consensus sequences 



First 
Residue 


Sequence 


Score 


A. 52 Residues 






12B3 


LK.MLETVRT P ALH F KS VGFHLP S REFQV PT FT I PKLYQLQVP-LLGVLDLSTN 


11.88 


2574 


E V S LQALgK ATFQTPDF IVP LTDLRI P S \Ql NFKDLKN I K I PSRFS T P EFT I - 


11.31 


2666 


LRDLKVED I PLARI TLPDFRLPEI A I PEF I IPTLNLNDFQVP-DLHI PEFQ^P 


1 0 oo 

W. lo 


3245 


YV F PKAVSMPSFS I LGSDVRVP S YT L I LP S L ELP VLH VPRNL-KLS LPDFKEL 


1 1 .yo 


3711 


NDLNSVLVM PTF H VP FTDLQVP SCKLDFR E I YKKLRTS S F- ALNLPTLPEV 


11 .40 


3805 


SDGl AALD LNAVANKI ADFELPTI I VPEQT IEIP S IKFSVPA-GIAI PSFQ/\L 


12.26 


Consensus 


LD S L K ALDM PT F H I P S SDF RLPSITIPEPTIEI PKLKNSQVP-ALS I PDFQJX 




B. 25 Residues 






1296 


FKSYGFHLPSRE FQV PTFTI PKLYQ 


11.84 


2585 


FQTPDF IVPLTD LR I PSVQI NFKDL 


12.76 


2629 


FH I PSFTIDFVEMKVKI I RT I DQML 


11 68 


2704 


FQV PDLH I PE FQL PH I SHT IE VPTF 


12.72 


3219 


FQ I PGYTVPWNVEVS PFTI EMS A F 


12.80 


3251 


VSMPSFSI LGSDVRVPSYTLI LPSL 


12.24 


3717 


LVMPTFHVPFTD LftVPSCKLDFRE I 


12.54 


3823 


FE LPTI I VPEQT I EI PS IK FS VP AG 


12 64 




FQMPSFHVPETDLEVPS I T I E VPA L 





Fifty two- and 25-residue-long consensus sequences derived by the iterative alignment procedure. Identical residues are printed in bold face type 
and related amino acids are underlined. The mean scores are calculated using the Sladen (20) scoring matrix Gaps are penalized as described in Methods 



elude more residues than originally proposed (36, 37). 
The comparison of the two domains yields additional in- 
teresting information about the structural requirements 
for receptor-protein interaction. The homologous resi- 
dues of the two sequences were plotted in an Edmundson- 
wheel diagram (35) (Fig. 5B) which shows that most 
residues match and that the amphipathic nature of the 
two segments is very well preserved (38). 

The general similarities in structure notwithstanding, 
interesting differences were also observed. Residues Lys- 
143 and Asp-154 of apoE are located on the same side of 
the helix and have opposite charges. Theoretically, they 
may neutralize each other and reduce their importance in 
the polar interactions with the LDL receptor. In apoB, the 
two corresponding residues, Leu- 3 3 60 and Ala- 3 3 71, are 
not charged, thus maintaining a neutral environment for 
receptor interaction. Futhermore, residue Arg-136 in 
apoE and residue Lys-3353 in apoB are both positively 
charged. When the receptor-binding domains of human 
(37), rat (39) and mouse (40) apoE are aligned with that 
of apoB, a positively charged amino acid is again present 
at this position (Fig. 5B). This would extend the residues 
potentially important for receptor-binding in apoE to 
136-160 which is compatible with the region identified by 
hydrophobicity analysis (31). The significance of this re- 
gion of homology was tested by comparing the 20-residue 



segment of apoB (3352-3371) with the whole apoE se- 
quence. Using 1,000 randomizations, we obtained a z- 
score of 7.84. The similarity of these protein domains 
does not go beyond these 20 residues because the statis- 
tical significance decreases markedly when more residues 
were aligned. 



DISCUSSION 



Because of the importance of apoB in the development 
of atherosclerosis and in genetically determined hyperlipi- 
demias, and because of the unique nature of its lipid- 
binding properties, the elucidation of the apoB structure 
has been a subject of intense research and a source of frus- 
tration for numerous investigators. The sequence of this 
protein is of special interest as apoB seems to be a highly 
polymorphic protein (41) and mutations in some func- 
tionally important domains may interfere with specific 
LDL-apoB functions such as receptor binding. Studies on 
this clinically important protein will provide insight into 
the structure-function relationship of its various subdo- 
mains in analogy with studies on apoE and with the LDL 
receptor (1). 
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Fig. 4. Correlograins for sequence comparison according to Kubota ct 
al. (27). (A) Correlograra of the total apoB sequence with the 52-residue 
"prolific-rich" consensus sequence. (B) Correlogram as in Fig. 4B with 
the 25-residue long consensus sequence. (C) Correlogram with the 22- 
residue consensus sequence. (D) Correlogram of the 22- residue consen- 
sus sequence with the human apoA-IV sequence. (E) Comparison of the 
22 -residue consensus sequence with 200- residue fragments of apoB us- 
ing the RDF program The SD values for the optimized scores obtained 
after 1000 randomizations (krup - 1) are plotted in function of the mid- 
point of the 200-residue fragments The location of amphipaihic-hclic 
repeats within the two zones, as described in the text, ia clearly visible 
as zones with scores exceeding the statistical significance limit of 3 SO 
values. 

Considering the importance of this sequence, we have 
analyzed it for the existence of internal repeats and of 
homologies with other apolipoproteins, hoping to gain in- 
sight into its structure-function relationship. 



Studies in other laboratories failed to identify internal 
repeated sequences or homology with other apolipopro- 
teins (7, 9, 19). One study suggested that similarity to 
other apolipoproteins is restricted to a single 11-residue 
segment between apoE and apoB, a domain with the 
putative LDL receptor-binding domain of both proteins 
(36). The negative conclusions are probably due to the ex- 
treme length of this sequence, making computations very 
time-consuming, to the different methods employed 
(methods generally developed for shorter sequences with 
highly conserved internal repeats), and to the fact that no 
repeats or homologous domains have very high similarity 
scores. In our previous study (8), we suggested that many 
potential internal repeats exist in apoB. The analysis, 
however, was not rigorous because we had not performed 
any randomization tests to see how often segments with 
high similarities can arise by chance. The present analysis 
is based on computer programs developed specially to de- 
tect internal repeats within long sequences at a lower 
threshold. Special attention was paid to the statistical sig- 
nificance of the apparent homologous regions identified. 
An iterative procedure was applied to the search of an op- 
timal consensus sequence for the apoB repeats detected 
on the comparison matrix. 

Our computation procedure has identified the am- 
phipathic helical segments (38) in apoB that are homolo- 
gous to those in other apolipoproteins. It has 
demonstrated significant homology between putative 
receptor-binding domains in apoB and apoE, and has fur- 
ther revealed the existence of "proline-rich" repeats 
characteristic for apoB. 

The homologous repeats that arc common to apoB and 
the other apolipoproteins show characteristic arnphipathic 
helices (Fig. 5A). These arnphipathic helices have been 
extensively investigated by different groups using a varie- 
ty of methods (38, 42-4-5), and are thought to be impor- 
tant for phospholipid binding. Interestingly, on intact 
LDL particles, the domains containing these repeats are 
generally inaccessible to trypsin (8), further supporting 
the hypothesis that these domains are involved in lipid 
binding. 

In contrast, the "proline -rich" repeats unique to apoB 
are characterized by the preponderance of hydrophobic 
residues. Their secondary structure is predicted (29, 30) 
to be composed of predominantly 0-sheets and /3-turns 
(due to proline residues). We speculate that they interact 
with lipids in a different way. Computer modeling (Bras- 
seur, R., H. De Loof, M. Rosseneu, and J-M. Ruysschaert, 
unpublished data) of these proline-rich sequences in the 
presence of dipalmitoylphosphatidylcholine suggests that 
the first pan of such a segment consists of a 0-sheet that 
might penetrate into the acyl chains. After a turn around 
a proline residue, the segment can form a second 0-sheet 
parallel to the first one, but with a reverse orientation. 
The relative symmetry of these structures can account for 
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TABLE 2. Twenty-two residue consensus 



First 
Residue 




Score 


Mean 
Hydrophobic it y 


Mean Helical 
Hydrophobic 
Moment 


2079 


/"\U\ ZD WD A A ¥ P If! DHA a Wnvi KI 




- .20 


0.98 


2135 


JJAIv 1 INF W HA. L, o ^£L.^<J J I XVI 1 K^t 


1 1 .63 


- .08 


0.89 


2173 


NUDE I I EKLKSLDEHYHIRVN 


12 09 


- .06 


1.02 


2384 


TFI EDVNKFLDML IKKLKSFDY 


11.59 


.06 


1.01 


2407 


QFVDETNDK I REVTQ.RLNGE 1_Q 


12.31 


- .32 


1.02 


4150 


RVTQEFHMKVKHL IDS L I DFLN 


12.68 


.03 


1.00 


4237 


DV I SMY RELLKDLSKEAQEVFK 


11.95 


- .12 


0.99 


4397 


EYI VSAS NFT SQLSSQVEQFLH 


11.85 


.13 


0.69 


4463 


D X H 2£ FR YK LQD F SDQL S D YYE 


12.90 


- .31 


0.82 


Consensus 


DF IDE FN EK LKD L SDQLND F LN 




-.13 


0.98 



Twenty-two-residue-long consensus sequence derived by the iterative alignment procedure. This consensus sequence is shown in an Edmundson 
wheel (35) representation in Fig. 5A. Identical residues arc printed in boldface type and related amino acids arc underlined. The mean hydrophobicity 
and mean helical hydrophobic moment are calculated for all segments as previously described (31). The mean hydrophobicity of the different segments 
is close to zero. The mean helical hydrophobic moment is, however, close to unity for most segments. This is indicative of the amphipalhic nature 
of these segments when oriented in a helical conformation. The presentation of the individual segments in an Edmundson-whecl diagram confirms 
that each segment can form an amphipathic helix. 



the partial overlapping of these repeats. Such a structure 
would be able to penetrate more deeply into the LDL 
than the amphipathic helices. 

Cooperativity in the lipid-binding of these two different 
classes of subdomains, i.e., amphipathic versus proline- 
rich regions, might account for the observation that, in 
contrast to the smaller apolipoproteins, apoB does not ex- 
change between different lipoprotein particles (5). Never- 
theless, it is noteworthy that there are no segments within 
the apoB sequence with a hydrophobicity comparable to 



the membrane-spanning segments of integral membrane 
proteins (6). 

The genomic structure of apoB has recently been 
reported by Blackhart et al. (46). The intron-exon junc- 
tions do not clearly define the relative positions of the 
various internal repeats, except that the last intron-exon 
boundary, occurring after the residue 4002, delineates the 
large COOH-terminal domain homologous with the 
other apolipoproteins. 

In conclusion, using a combination of computation 
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IB T-D/DD 
7 R-R/RR 

H L-L/MM 
3 E-A/SS 
10 R-R/RR 
17 A- A/A A 
6 T-L/LL 
13 G-L/LL 

2 L-L/tL 
20 l-l/Ll 



11 K-K/KK 

AO-S/TT 

15K-R/RR 




8 L-K/KK 

19 A-D/DO 
1 K-R/RR 



12 R-R/RR 
5 T-H/HH 
16 L-D/DD 



9T-L/MM 



Fig. 5. (A) Edmundson-wheel diagram (35) of the 22-residue consensus sequence. One side of the helix clearly has highly hydrophobic amino acids 
while n on -hydro phobic amino acids are present on the other side of the helix. (B) Edmundson-wheel representation of the region of the apoB sequences 
(3352-3371) homologous to the receptor binding region of human apoE (136-155) (31, 37) and corresponding residues of the rat apoE (39), and mouse 
apoE sequence (residues 128-147) (40) Individual amino acids, labeled from left to right, represent residues in human apoB, human apoE/rat apoE, 
mouse apoE, respectively, e.g., 18 T-D/D D. 
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procedures, we have identified the presence of different 
types of internal repeats in apoB-100 and sequence homo- 
logy between this protein and the soluble apolipoproteins. 
The internal repeats share interesting physical properties 
which might bear on the overall physicochemical behavior 
of apoB-100. The methods used in this study can be 
directly applied to the identification of intra- and interse- 
quence homologies among other protein sequences of dif- 
ferent lengths. BB 
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